Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a broad range of application domains. Applications of data science range from domains including Healthcare, Gaming, Recommendation Systems, Logistics, Fraud Detection, Internet Search, Targeted Advertising, Speech Recognition, Airline Route Planning and many more. In today's world where data is increasing exponentionally, there is an urgent need to hire good data scientists, to derieve meaningful information for that huge sources of data.
In this exploratory data analysis, we Analize a dataset which consists of salaries of various Data Scientists ranging between year 2020-2022 across various Job titles and distinct geographical regions. We aim at derieving some information about how the features relate or correlate with the salary
The dataset consists of 11 features with 606 candidate's salary estimated. The features used in the dataset are described below:
| Columns | description |
|---|---|
work_year |
The year the salary was paid. |
experience_level |
The experience level in the job during the year with the following possible values: EN Entry-level/ Junior MI Mid-level / Intermediate SE Senior-level / Expert EX Executive-level / Director |
employment_type |
The type of employement for the role: PT Part-time FT Full-time CT Contract FL Freelance |
job_title |
The role worked in during the year. |
salary |
The total gross salary amount paid. |
salary_currency |
The currency of the salary paid as an ISO 4217 currency code. |
salaryinusd |
The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com). |
employee_residence |
Employee's primary country of residence in during the work year as an ISO 3166 country code. |
remote_ratio |
The overall amount of work done remotely, possible values are as follows: 0 No remote work (less than 20%) 50 Partially remote 100 Fully remote (more than 80%) |
company_location |
The country of the employer's main office or contracting branch as an ISO 3166 country code. |
company_size |
The average number of people that worked for the company during the year: S less than 50 employees (small) M 50 to 250 employees (medium) L more than 250 employees (large) |
There are 280 senior level experience (make up approximately 46% of the experience_level column), 213 Mid-level experience (make up approximately. 35% of the experience_level column), 88 Entry-Level experience (make up approximately 14.5% of the experience_level column), and 26 Executive-Level (make up approximately 4.3% of the experience_level column)
Questions:
There are a total of 50 unique job entries in the job_title column. And from the word cloud, base on the most frequent job titles in the dataframe, Data Scientist, Data Analyst, Data Engineer, Machine Learning Engineer, Data Science Manager, Data Analytics Manager, Big Data Engineer, Machine Learning Scientist, e.t.c... appear boldly that others [base on the frequency of occurence]. Let's further streamline the word cloud to get the top 10 job title.
the top 10 job titles, from the plot above, are: Data Science; Data Engineer; Data Analyst; Machine Learning Engineer; Research Scientist; Data Science Manager; Data Architect; Big Data Engineer; Machine Learning Scienctist; and Data Analytics Manager, respectively.
From the plot above, we see that Medium (M) size companies - companies with 50 to 250 employees - have the highest frequency in the distribution of company sizes [in the dataframe]. Followed by Large size companies - companies with more than 250 employees. And the least, Small (S) size companies - companies with less than 50 employees.
let's investigate the distribution of employee's residence
Question:
From the plot above, we see that the top 10 employees' residents are: United States of America (US), Great Britain (GB), India (IN), Canada (CA), Germany (DE), France (FR), Spain (ES), Greece (GR), Japan (JP), and Pakistan (PK), respectively.
Question:
The top 10 company location are: United States of America (US), Great Britain (GB), Canada (CA), Germany (DE), India (IN), France (FR), Spain (ES), Greece (GR), Japan (JP), and Austria (AT)
We see an upward trend in the number of jobs from 2020 to 2022; relatively increasing in percentages.
Text(0.5, 1.0, 'Salary (usd) Distribution')
The salary (in usd) appears to be highly distributed around 63k - 150k, with a median salary of approximately 102k. Also, from the histogram, we see that the distribution is rightly skewed. From the box plot, we see that there exist salaries (in usd) above the upper fence (276k) of the plot. Let's further investigate these salaries.
From the plot, we see that there are 380 jobs labeled Fully Remote (approx 63% of the Remote type); 130 jobs labeled Partially Remote (approx. 21% of the remote type); and 99 jobs labeled No Remote (approx. 16%)
We see that average salary differs accross different company size, from small to large. While [from the plots above] The average salary for a job listing in a small company size is approximately 776,300 USD, that of medium size company - 116,910 USD; and that of large company size - 119,240 USD, the median salary with respect to company size are 65,000 USD (for small company size), 113,188 USD (for medium size company), and 100,000 USD (for large company size).
While this is a pretty interesting insight concerning the mean and median salary according to company size, let's further investigate these salaries listing with respect experience level...
Looking at the plots, we see a progressing increase accrose the various experience level [from Entry_level/Junior to Executive_level/Director]; from the histogram-box plot, we see that the median salary for the various experience lever are:
Also, the distribuiton for the various experience level is rightly skewed (contain probable outliers); This can be easily spotted as the dots on the right on each of the box plots, respectively.
Question:
from the plot, we see a general upward movement of the mean salary from Entry_level/Junior to Executive_level/Director among the various company size except for the Small company size which have the Mid_level/Intermediate mean salary lower than the Entry_level/Junior level. This is quite unexpected. And will need to be investigate further.
Question:
Interestingly, between 2020 and 2022, we see that the Romote job listings has the highest mean salary; followed by the No-remote condition; and the least mean salary being that of Hybrid working-condition.
Question:
Although Data Science, Data Engineer and Data Analyst are the top 3 most frequent job titles, we can see that they are not the hightest paid among the top 10; Among the top 10 job titles [based on frequent occurence in the dataframe], the job titles with the highest mean salary are Director of Data Science - highest Avarage salary of 195,074 USD; followed by Data Architect - approximately 177,874 USD; Data Science Manager - approximately 158,329 USD.
Next, let's find out what the top 10 job titles with the highest mean salary by experience level are.
Although, Mid_level/Intermediate and Entry_level/Junior experience appeared in the plot, we see that the top 10 Avarage job title salaries mainly required Executive_level/Director experience and Senior_level/Expert experience.
Next, lets see what the median salaries are for the top 10 Entry_level/Junior level experience.
The top 10 mean salary plots shows us that the top three (3) job titles for the Entry-level/Junior level experience are not the highest paid job titles, rather Machine Learning Engineer, Research Scientist and Business Data Analyst are the three (3) job titles with the highest mean salaries.
Analysis on this data furnish us with informations concerning Data Science field:
In conclusion, pursuing a career in Data Science related jobs is a very good choice with tremendous opportunities for advancement in the future. Already, demand is high, salaries are competitive, and the perks are numerous. No wonder it is referred to as the "sexiest job of the 21st century" by Harvard Business Review in 2012
[NbConvertApp] Converting notebook Data_Science_Salaries.ipynb to slides
[NbConvertApp] Writing 4946430 bytes to Data_Science_Salaries.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\Scripts\jupyter-nbconvert-script.py", line 10, in <module>
sys.exit(main())
File "C:\ProgramData\Anaconda3\lib\site-packages\jupyter_core\application.py", line 254, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "C:\Users\PaulPlay\AppData\Roaming\Python\Python38\site-packages\traitlets\config\application.py", line 976, in launch_instance
app.start()
File "C:\Users\PaulPlay\AppData\Roaming\Python\Python38\site-packages\nbconvert\nbconvertapp.py", line 414, in start
self.convert_notebooks()
File "C:\Users\PaulPlay\AppData\Roaming\Python\Python38\site-packages\nbconvert\nbconvertapp.py", line 588, in convert_notebooks
self.convert_single_notebook(notebook_filename)
File "C:\Users\PaulPlay\AppData\Roaming\Python\Python38\site-packages\nbconvert\nbconvertapp.py", line 555, in convert_single_notebook
self.postprocess_single_notebook(write_results)
File "C:\Users\PaulPlay\AppData\Roaming\Python\Python38\site-packages\nbconvert\nbconvertapp.py", line 525, in postprocess_single_notebook
self.postprocessor(write_results)
File "C:\Users\PaulPlay\AppData\Roaming\Python\Python38\site-packages\nbconvert\postprocessors\base.py", line 27, in __call__
self.postprocess(input)
File "C:\Users\PaulPlay\AppData\Roaming\Python\Python38\site-packages\nbconvert\postprocessors\serve.py", line 91, in postprocess
http_server.listen(self.port, address=self.ip)
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\tcpserver.py", line 151, in listen
sockets = bind_sockets(port, address=address)
File "C:\ProgramData\Anaconda3\lib\site-packages\tornado\netutil.py", line 161, in bind_sockets
sock.bind(sockaddr)
OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted